A search engine is a software system that helps users find relevant information from a large collection of data by processing queries and returning matching results.
DSA Search Engine
It typically works by indexing content (such as web pages or documents), allowing users to perform keyword-based searches, and ranking the results based on relevance.
Popular real-world examples include Google, Bing, and Elasticsearch.
In this chapter, we will explore the low-level design of a simple in-memory search engine.
Let's start by clarifying the requirements:
Before starting the design, it's important to ask thoughtful questions to uncover hidden assumptions, clarify ambiguities, and define the system's scope more precisely.
Here is an example of how a conversation between the candidate and the interviewer might unfold:
Candidate: Implementing web crawling can add significant complexity. Should we preload documents or web pages into the system?
Interviewer: For this version, assume a predefined set of documents or web pages is already available in memory. No need to implement crawling.
Candidate: Should the search engine support only keyword-based search, or also handle phrases queries and logical operators?
Interviewer: Keep it simple for now. Basic keyword-based search is sufficient.
Candidate: Should the system return only exact matches, or also support partial and fuzzy matches?
Interviewer: Let's support exact matches only for now. You can assume case-insensitive search.
Candidate: Do we need to rank the results, or is returning any matching document enough?
Interviewer: Basic scoring and ranking should be implemented (e.g., based on the frequency of the keyword within each document).
Candidate: Should we include text processing techniques like stop-word removal or stemming during indexing and querying?
Interviewer: Not for this version. Treat all words equally. No stop-word removal or stemming.
Candidate: Should we allow users to input search queries dynamically, or can we hardcode a set of search queries?
Interviewer: For this version, assume queries are predefined and supplied via code. No need to handle runtime user input.
After gathering the details, we can summarize the key system requirements.
Core entities are the fundamental building blocks of our system. We identify them by analyzing the functional requirements and mapping the key responsibilities to object-oriented abstractions—classes, interfaces, or enums.
Let’s walk through the functional requirements and extract the relevant entities:
This indicates the need for a Document entity to represent each searchable item. Each document should have a unique identifier and raw text content.
To manage all available documents, we introduce a DocumentStore entity. This serves as an in-memory container that exposes APIs to add and retrieve documents.
For efficient keyword-based retrieval, we require an InvertedIndex—a well-known data structure in search engines. It maps terms (keywords) to the list of documents that contain them, along with metadata such as frequency.
To support this, we define a Posting entity that represents an occurrence of a term in a document. Each posting includes:
In addition, we introduce a SearchResult entity that packages:
DocumentTo orchestrate the entire search pipeline, we define a SearchEngine entity.
Document: Represents a searchable item with fields like ID, title, and content.DocumentStore: Maintains all documents in memory and provides retrieval methods.InvertedIndex: Core data structure mapping terms to their document postings.Posting: Represents the occurrence of a term in a document (document ID, frequency, etc.).SearchResult: Represents a matched document along with metadata and relevance score.SearchEngine: Coordinates the search process, from indexing and query parsing to retrieval.These core entities define the essential abstractions of the in-memory search engine and will guide the structure of your low-level design and class diagrams.
This section outlines the classes that form the building blocks of the search engine, their responsibilities, and the relationships between them.
The system is designed with a clear separation of concerns, categorized into data classes that hold information and core classes that implement the engine's logic.
These are simple Plain Old Java Objects (POJOs) or data containers with minimal logic.
DocumentRepresents a single unit of information to be indexed and searched. It contains a unique id, a title, and its content.
Posting An entry in the inverted index.
It encapsulates the documentId where a term appears and the frequency of that term within the document.
SearchResultA container that pairs a Document with its calculated relevance score, used for ranking and display.
These classes contain the main business logic for indexing, searching, scoring, and ranking.
DocumentStoreActs as an in-memory database, mapping document IDs to Document objects for quick retrieval.
InvertedIndexThe central data structure of the engine.
It maps each term (word) to a list of Posting objects, enabling fast lookups of documents containing a specific term.
A concrete implementation of RankingStrategy that sorts by score, using the document title alphabetically as a tie-breaker.
SearchEngineThe main orchestrator class.
It provides a simple public API for indexing documents and performing searches, hiding the underlying complexity of the system.
The classes interact through a combination of composition, association, and dependency, creating a robust and flexible system.
The SearchEngine has a strong "owns-a" relationship with its core components.
SearchEngine ◆── InvertedIndex: The SearchEngine creates and manages the lifecycle of its InvertedIndex. The index cannot exist without the engine.SearchEngine ◆── DocumentStore: Similarly, the DocumentStore is an integral part of the SearchEngine and is managed by it.The index and store have "has-a" relationships with their data objects.
InvertedIndex ◇── Posting: The InvertedIndex contains a map of terms to lists of Posting objects. The postings are part of the index but represent data linked to independent documents.DocumentStore ◇── Document: The DocumentStore holds a collection of Document objects, which are created externally and added to the store.The SearchEngine has a "uses-a" relationship with its strategies.
SearchEngine → ScoringStrategy: The SearchEngine holds a reference to a ScoringStrategy object. This allows the scoring algorithm to be changed dynamically (pluggable behavior).SearchEngine → RankingStrategy: The SearchEngine also holds a reference to a RankingStrategy, allowing the ranking logic to be easily swapped.SearchResult → Document: Each SearchResult is associated with the Document it represents.Several classes depend on others to perform their tasks, often as method parameters.
SearchEngine depends on Document for indexing and SearchResult for returning search results.ScoringStrategy implementations depend on Posting and Document to calculate a score.Several design patterns are employed to ensure the system is efficient, scalable, and maintainable.
This pattern is used to make the scoring and ranking algorithms interchangeable.
ScoringStrategyThe ScoringStrategy interface and its concrete implementations (TermFrequencyScoringStrategy, TitleBoostScoringStrategy) allow the client to choose how documents are scored without modifying the SearchEngine.
RankingStrategyLikewise, the RankingStrategy interface and its implementations allow the sorting logic for results to be defined and selected at runtime.
The SearchEngine class acts as a Facade. It provides a simplified, high-level interface (indexDocuments, search) to the more complex underlying subsystem of indexing, data storage, scoring, and ranking. This decouples the client from the internal workings of the search engine.
The SearchEngine is implemented as a Singleton to ensure there is only one instance managing the index and document store for the entire application. This provides a single, global point of access and prevents inconsistencies from multiple competing instances.
Represents a unit of information indexed by the search engine.
1class Document:
2 def __init__(self, id: str, title: str, content: str):
3 self.id = id
4 self.title = title
5 self.content = content
6
7 def get_id(self) -> str:
8 return self.id
9
10 def get_title(self) -> str:
11 return self.title
12
13 def get_content(self) -> str:
14 return self.content
15
16 def __str__(self) -> str:
17 return f"Document(id={self.id}, title='{self.title}')"Each document has:
idtitle and content for search and scoringActs as an in-memory database for documents.
1class DocumentStore:
2 def __init__(self):
3 self.store: Dict[str, Document] = {}
4
5 def add_document(self, doc: Document):
6 self.store[doc.get_id()] = doc
7
8 def get_document(self, doc_id: str) -> Optional[Document]:
9 return self.store.get(doc_id)Supports retrieval by ID during scoring and search result generation.
Encapsulates term-specific metadata within a document. Used as entries in the inverted index.
1class Posting:
2 def __init__(self, document_id: str, frequency: int):
3 self.document_id = document_id
4 self.frequency = frequency
5
6 def get_document_id(self) -> str:
7 return self.document_id
8
9 def get_frequency(self) -> int:
10 return self.frequencydocumentId: The document where the term appearsfrequency: How often the term occurs (used for scoring)Maps each term to a list of Postings.
1class InvertedIndex:
2 def __init__(self):
3 self.index: Dict[str, List[Posting]] = defaultdict(list)
4
5 def add(self, term: str, document_id: str, frequency: int):
6 postings = self.index.get(term, [])
7 postings.append(Posting(document_id, frequency))
8 self.index[term] = postings
9
10 def get_postings(self, term: str) -> List[Posting]:
11 return self.index.get(term, [])This is the heart of the search engine that enables fast lookup of documents containing a query term.
An inverted index is the fundamental data structure that makes search engines fast. Instead of scanning every document for a query term (which would be very slow), we pre-process the documents and build a map from each term (word) to a list of documents that contain it.
Pairs a document with its calculated relevance score. Used for ranking and presenting the final search results to the user.
1class SearchResult:
2 def __init__(self, document: Document, score: float):
3 self.document = document
4 self.score = score
5
6 def get_document(self) -> Document:
7 return self.document
8
9 def get_score(self) -> float:
10 return self.score
11
12 def __str__(self) -> str:
13 return f" - {self.document.get_title()} (Score: {self.score:.2f})"Implements the Strategy pattern for scoring.
1class ScoringStrategy(ABC):
2 @abstractmethod
3 def calculate_score(self, term: str, posting: Posting, document: Document) -> float:
4 pass
5
6class TermFrequencyScoringStrategy(ScoringStrategy):
7 def calculate_score(self, term: str, posting: Posting, document: Document) -> float:
8 return posting.get_frequency()
9
10class TitleBoostScoringStrategy(ScoringStrategy):
11 TITLE_BOOST_FACTOR = 1.5
12
13 def calculate_score(self, term: str, posting: Posting, document: Document) -> float:
14 score = posting.get_frequency()
15 if term in document.get_title().lower():
16 score *= self.TITLE_BOOST_FACTOR
17 return scoreThe ScoringStrategy interface defines a contract for any scoring algorithm. The SearchEngine holds a reference to an object of this type. This allows us to easily switch between a simple TermFrequencyScoringStrategy and a more advanced TitleBoostScoringStrategy at runtime.
Implements the Strategy pattern for ranking.
1class RankingStrategy(ABC):
2 @abstractmethod
3 def rank(self, results: List[SearchResult]):
4 pass
5
6class ScoreBasedRankingStrategy(RankingStrategy):
7 def rank(self, results: List[SearchResult]):
8 results.sort(key=lambda x: x.get_score(), reverse=True)
9
10class ScoreThenAlphabeticalRankingStrategy(RankingStrategy):
11 def rank(self, results: List[SearchResult]):
12 results.sort(key=lambda x: (-x.get_score(), x.get_document().get_title()))Similar to scoring, the RankingStrategy allows us to define different ways to order the final results. The ScoreBasedRankingStrategy provides a standard relevance sort, while the ScoreThenAlphabeticalRankingStrategy shows how we can handle tie-breaking gracefully, a common requirement in real-world systems.
This class acts as a central Facade and Singleton, orchestrating all the components to provide a simple API for indexing and searching.
1class SearchEngine:
2 _instance = None
3
4 def __new__(cls):
5 if cls._instance is None:
6 cls._instance = super().__new__(cls)
7 return cls._instance
8
9 def __init__(self):
10 if hasattr(self, '_initialized'):
11 return
12 self._initialized = True
13 self.inverted_index = InvertedIndex()
14 self.document_store = DocumentStore()
15 self.scoring_strategy = None
16 self.ranking_strategy = None
17
18 @classmethod
19 def get_instance(cls):
20 if cls._instance is None:
21 cls._instance = cls()
22 return cls._instance
23
24 def set_scoring_strategy(self, scoring_strategy: ScoringStrategy):
25 self.scoring_strategy = scoring_strategy
26
27 def set_ranking_strategy(self, ranking_strategy: RankingStrategy):
28 self.ranking_strategy = ranking_strategy
29
30 def index_documents(self, documents: List[Document]):
31 for doc in documents:
32 self.index_document(doc)
33
34 def index_document(self, doc: Document):
35 self.document_store.add_document(doc)
36 term_frequencies: Dict[str, int] = {}
37
38 text = (doc.get_title() + " " + doc.get_content()).lower()
39 tokens = re.split(r'\W+', text)
40
41 for token in tokens:
42 if token:
43 term_frequencies[token] = term_frequencies.get(token, 0) + 1
44
45 for term, frequency in term_frequencies.items():
46 self.inverted_index.add(term, doc.get_id(), frequency)
47
48 def search(self, query: str) -> List[SearchResult]:
49 processed_query = query.lower()
50
51 postings = self.inverted_index.get_postings(processed_query)
52
53 results = []
54 for posting in postings:
55 doc = self.document_store.get_document(posting.get_document_id())
56 if doc is not None:
57 score = self.scoring_strategy.calculate_score(processed_query, posting, doc)
58 results.append(SearchResult(doc, score))
59
60 self.ranking_strategy.rank(results)
61
62 return resultsThis driver class shows how a client would interact with the SearchEngine and demonstrates the flexibility of the strategy-based design.
1class SearchEngineDemo:
2 @staticmethod
3 def main():
4 engine = SearchEngine.get_instance()
5
6 documents = [
7 Document("doc1", "Java Performance", "Java is a high-performance language. Tuning Java applications is key."),
8 Document("doc2", "Introduction to Python", "Python is a versatile language, great for beginners."),
9 Document("doc3", "Advanced Java Concepts", "This document covers advanced topics in Java programming."),
10 Document("doc4", "Python vs. Java", "A document comparing Python and Java for web development. Java is faster.")
11 ]
12
13 print("Indexing documents...")
14 engine.index_documents(documents)
15 print("Indexing complete.\n")
16
17 print("====== TermFrequency Scoring + ScoreBased Ranking ======")
18 engine.set_scoring_strategy(TermFrequencyScoringStrategy())
19 engine.set_ranking_strategy(ScoreBasedRankingStrategy())
20
21 SearchEngineDemo.perform_search(engine, "java")
22 SearchEngineDemo.perform_search(engine, "language")
23 SearchEngineDemo.perform_search(engine, "performance")
24
25 print("\n====== TitleBoost Scoring + Score-then-Alphabetical Ranking ======")
26 engine.set_scoring_strategy(TitleBoostScoringStrategy())
27 engine.set_ranking_strategy(ScoreThenAlphabeticalRankingStrategy())
28
29 SearchEngineDemo.perform_search(engine, "java")
30 SearchEngineDemo.perform_search(engine, "language")
31 SearchEngineDemo.perform_search(engine, "performance")
32
33 SearchEngineDemo.perform_search(engine, "paint")
34
35 @staticmethod
36 def perform_search(engine: SearchEngine, query: str):
37 print(f"--- Searching for: '{query}' ---")
38 results = engine.search(query)
39
40 if not results:
41 print(" No results found.")
42 else:
43 for i, result in enumerate(results):
44 print(f"Rank {i + 1}:{result}")
45 print()
46
47
48if __name__ == "__main__":
49 SearchEngineDemo.main()Which entity is responsible for representing a single piece of content that can be searched in the simple search engine design?
No comments yet. Be the first to comment!